71 research outputs found
Model-based clustering of categorical data based on the Hamming distance
A model-based approach is developed for clustering categorical data with no
natural ordering. The proposed method exploits the Hamming distance to define a
family of probability mass functions to model the data. The elements of this
family are then considered as kernels of a finite mixture model with unknown
number of components. Conjugate Bayesian inference has been derived for the
parameters of the Hamming distribution model. The mixture is framed in a
Bayesian nonparametric setting and a transdimensional blocked Gibbs sampler is
developed to provide full Bayesian inference on the number of clusters, their
structure and the group-specific parameters, facilitating the computation with
respect to customary reversible jump algorithms. The proposed model encompasses
a parsimonious latent class model as a special case, when the number of
components is fixed. Model performances are assessed via a simulation study and
reference datasets, showing improvements in clustering recovery over existing
approaches
Mixture modeling via vectors of normalized independent finite point processes
Statistical modeling in presence of hierarchical data is a crucial task in
Bayesian statistics. The Hierarchical Dirichlet Process (HDP) represents the
utmost tool to handle data organized in groups through mixture modeling.
Although the HDP is mathematically tractable, its computational cost is
typically demanding, and its analytical complexity represents a barrier for
practitioners. The present paper conceives a mixture model based on a novel
family of Bayesian priors designed for multilevel data and obtained by
normalizing a finite point process. A full distribution theory for this new
family and the induced clustering is developed, including tractable expressions
for marginal, posterior and predictive distributions. Efficient marginal and
conditional Gibbs samplers are designed for providing posterior inference. The
proposed mixture model overcomes the HDP in terms of analytical feasibility,
clustering discovery, and computational time. The motivating application comes
from the analysis of shot put data, which contains performance measurements of
athletes across different seasons. In this setting, the proposed model is
exploited to induce clustering of the observations across seasons and athletes.
By linking clusters across seasons, similarities and differences in athlete's
performances are identified
Gaussian graphical modeling for spectrometric data analysis
Motivated by the analysis of spectrometric data, we introduce a Gaussian
graphical model for learning the dependence structure among frequency bands of
the infrared absorbance spectrum. The spectra are modeled as continuous
functional data through a B-spline basis expansion and a Gaussian graphical
model is assumed as a prior specification for the smoothing coefficients to
induce sparsity in their precision matrix. Bayesian inference is carried out to
simultaneously smooth the curves and to estimate the conditional independence
structure between portions of the functional domain. The proposed model is
applied to the analysis of infrared absorbance spectra of strawberry purees
Dynamic model-based clustering for spatio-temporal data
In many research fields, scientific questions are investigated by analyzing data collected over space and time, usually at fixed spatial locations and time steps and resulting in geo-referenced time series. In this context, it is of interest to identify potential partitions of the space and study their evolution over time. A finite space-time mixture model is proposed to identify level-based clusters in spatio-temporal data and study their temporal evolution along the time frame. We anticipate space-time dependence by introducing spatio-temporally varying mixing weights to allocate observations at nearby locations and consecutive time points with similar cluster’s membership probabilities. As a result, a clustering varying over time and space is accomplished. Conditionally on the cluster’s membership, a state-space model is deployed to describe the temporal evolution of the sites belonging to each group. Fully posterior inference is provided under a Bayesian framework through Monte Carlo Markov chain algorithms. Also, a strategy to select the suitable number of clusters based upon the posterior temporal patterns of the clusters is offered. We evaluate our approach through simulation experiments, and we illustrate using air quality data collected across Europe from 2001 to 2012, showing the benefit of borrowing strength of information across space and time
Estimate of overdiagnosis of breast cancer due to mammography after adjustment for lead time. A service screening study in Italy
INTRODUCTION: Excess of incidence rates is the expected consequence of service screening. The aim of this paper is to estimate the quota attributable to overdiagnosis in the breast cancer screening programmes in Northern and Central Italy. METHODS: All patients with breast cancer diagnosed between 50 and 74 years who were resident in screening areas in the six years before and five years after the start of the screening programme were included. We calculated a corrected-for-lead-time number of observed cases for each calendar year. The number of observed incident cases was reduced by the number of screen-detected cases in that year and incremented by the estimated number of screen-detected cases that would have arisen clinically in that year. RESULTS: In total we included 13,519 and 13,999 breast cancer cases diagnosed in the pre-screening and screening years, respectively. In total, the excess ratio of observed to predicted in situ and invasive cases was 36.2%. After correction for lead time the excess ratio was 4.6% (95% confidence interval 2 to 7%) and for invasive cases only it was 3.2% (95% confidence interval 1 to 6%). CONCLUSION: The remaining excess of cancers after individual correction for lead time was lower than 5%
breast screening axillary lymph node status of interval cancers by interval year
Abstract The aim of this study was to determine whether the excess risk of axillary lymph node metastases (N+) differs between interval breast cancers arising shortly after a negative mammography and those presenting later. In a registry-based series of pT1a–pT3 breast carcinoma patients aged 50–74years from the Italian screening programmes, the odds ratio (OR) for interval cancers ( n =791) versus the screen-detected (SD) cancers ( n =1211) having N+ was modelled using forward stepwise logistic regression analysis. The interscreening interval was divided into 1–12, 13–18, and 19–24months. The prevalence of N+ was 28% among SD cancers. With a prevalence of 38%, 42%, and 44%, the adjusted (demographics and N staging technique) OR of N+ for cancers diagnosed between 1–12, 13–18, and 19–24months of interval was 1.41 (95% confidence interval 1.06–1.87), 1.74 (1.31–2.31), and 1.91 (1.43–2.54), respectively. Histologic type, tumour grade, and tumour size were entered in turn into the model. Histologic type had modest effects. With adjustment for tumour grade, the ORs decreased to 1.23 (0.92–1.65), 1.58 (1.18–2.12), and 1.73 (1.29–2.32). Adjusting for tumour size decreased the ORs to 0.95 (0.70–1.29), 1.34 (0.99–1.81), and 1.37 (1.01–1.85). The strength of confounding by tumour size suggested that the excess risk of N+ for first-year interval cancers reflected only their higher chronological age, whereas the increased aggressiveness of second-year interval cancers was partly accounted for by intrinsic biological attributes
Clinical epigenetics settings for cancer and cardiovascular diseases: real-life applications of network medicine at the bedside
Despite impressive efforts invested in epigenetic research in the last 50 years, clinical applications are still lacking. Only a few university hospital centers currently use epigenetic biomarkers at the bedside. Moreover, the overall concept of precision medicine is not widely recognized in routine medical practice and the reductionist approach remains predominant in treating patients affected by major diseases such as cancer and cardiovascular diseases. By its' very nature, epigenetics is integrative of genetic networks. The study of epigenetic biomarkers has led to the identification of numerous drugs with an increasingly significant role in clinical therapy especially of cancer patients. Here, we provide an overview of clinical epigenetics within the context of network analysis. We illustrate achievements to date and discuss how we can move from traditional medicine into the era of network medicine (NM), where pathway-informed molecular diagnostics will allow treatment selection following the paradigm of precision medicine
Bayesian space-time data fusion for real-time forecasting and map uncertainty
Environmental computer models are deterministic models devoted to predict several environmental phenomena such as air pollution or meteorological events. Numerical model output is given in terms of averages over grid cells, usually at high spatial and temporal resolution. However, these outputs are often biased with unknown calibration and not equipped with any information about the associated uncertainty. Conversely, data collected at monitoring stations is more accurate since they essentially provide the true levels. Due the leading role played by numerical models, it now important to compare model output with observations. Statistical methods developed to combine numerical model output and station data are usually referred to as data fusion.
In this work, we first combine ozone monitoring data with ozone predictions from the Eta-CMAQ air quality model in order to forecast real-time current 8-hour average ozone level defined as the average of the previous four hours, current hour, and predictions for the next
three hours. We propose a Bayesian downscaler
model based on first differences with a flexible
coefficient structure and an efficient computational strategy to fit model parameters. Model validation for the eastern United States shows consequential improvement of our fully inferential approach compared with the current real-time forecasting system. Furthermore, we consider the introduction of temperature data from a weather forecast model into the downscaler, showing improved real-time ozone predictions.
Finally, we introduce a hierarchical model to obtain spatially varying uncertainty associated with numerical model output. We show how we can learn about such uncertainty through suitable stochastic data fusion modeling using some external validation data. We illustrate our Bayesian model by providing the uncertainty map associated with a temperature output over the northeastern United States
Quantifying uncertainty associated with a numerical model output
Environmental numerical models are deterministic tools widely used to simulate and predict complex systems. However, they are unsatisfying since they do not provide information about the uncertainty associated with their predictions. Conversely, uncertainty assessment of model outputs can be useful to guide environmental agencies in improving computer models. We propose a Bayesian hierarchical model to obtain spatially varying uncertainty associated with a numerical model output. We show how we can learn about such uncertainty through suitable stochastic data fusion modeling using some external validation data. The model is illustrated by providing the uncertainty map associated with a temperature output over the northeastern United States
- …